Problem Statement¶
Context¶
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective¶
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
Data Dictionary¶
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Importing necessary libraries¶
# Installing the libraries with the specified version.
!pip3 install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Importing scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier # Example for classifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
Loading the dataset¶
data = pd.read_csv('Loan_Modelling.csv');
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
Data Overview¶
- Observations
- Sanity checks
df = data.copy()
print(f"shape of data: {data.shape}")
print(f"{data.info()}")
print(f"data description:{data.describe().T}" )
if(data.isnull().sum().sum() == 0):
print(f"There are no null values in the provide data" )
else:
print(f"There are {data.isnull().sum().sum()} null value in data")
shape of data: (5000, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 5000 non-null int64
1 Age 5000 non-null int64
2 Experience 5000 non-null int64
3 Income 5000 non-null int64
4 ZIPCode 5000 non-null int64
5 Family 5000 non-null int64
6 CCAvg 5000 non-null float64
7 Education 5000 non-null int64
8 Mortgage 5000 non-null int64
9 Personal_Loan 5000 non-null int64
10 Securities_Account 5000 non-null int64
11 CD_Account 5000 non-null int64
12 Online 5000 non-null int64
13 CreditCard 5000 non-null int64
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
None
data description: count mean std min 25% \
ID 5000.0 2500.500000 1443.520003 1.0 1250.75
Age 5000.0 45.338400 11.463166 23.0 35.00
Experience 5000.0 20.104600 11.467954 -3.0 10.00
Income 5000.0 73.774200 46.033729 8.0 39.00
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00
Family 5000.0 2.396400 1.147663 1.0 1.00
CCAvg 5000.0 1.937938 1.747659 0.0 0.70
Education 5000.0 1.881000 0.839869 1.0 1.00
Mortgage 5000.0 56.498800 101.713802 0.0 0.00
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00
CD_Account 5000.0 0.060400 0.238250 0.0 0.00
Online 5000.0 0.596800 0.490589 0.0 0.00
CreditCard 5000.0 0.294000 0.455637 0.0 0.00
50% 75% max
ID 2500.5 3750.25 5000.0
Age 45.0 55.00 67.0
Experience 20.0 30.00 43.0
Income 64.0 98.00 224.0
ZIPCode 93437.0 94608.00 96651.0
Family 2.0 3.00 4.0
CCAvg 1.5 2.50 10.0
Education 2.0 3.00 3.0
Mortgage 0.0 101.00 635.0
Personal_Loan 0.0 0.00 1.0
Securities_Account 0.0 0.00 1.0
CD_Account 0.0 0.00 1.0
Online 1.0 1.00 1.0
CreditCard 0.0 1.00 1.0
There are no null values in the provide data
Exploratory Data Analysis.¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
- How many customers have credit cards?
- What are the attributes that have a strong correlation with the target attribute (personal loan)?
- How does a customer's interest in purchasing a loan vary with their age?
- How does a customer's interest in purchasing a loan vary with their education?
#1
print(f"*** What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution? ***")
mortgage_dist = df['Mortgage']
plt.figure(figsize=(10,8))
sns.histplot(data=df, x=mortgage_dist)
plt.figure(figsize=(10,8))
sns.boxplot(data=df, x=mortgage_dist)
mortgage_dist_atzero = mortgage_dist[mortgage_dist == 0]
print(f"a. There are {mortgage_dist_atzero.shape[0]} customers that have no mortgage or the mortgage at 0 leading most of the data to be present at the first quartile")
mortgage_dist_outliers = mortgage_dist[mortgage_dist > 250]
print(f"b. There are about {mortgage_dist_outliers.shape[0]} outliers in mortgage out of {mortgage_dist.shape[0]} rows leading to approximately {(mortgage_dist_outliers.shape[0]/mortgage_dist.shape[0]) * 100:.2f}% of outliers in overall data")
print(f"c. Mortgage is right skewed")
*** What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution? *** a. There are 3462 customers that have no mortgage or the mortgage at 0 leading most of the data to be present at the first quartile b. There are about 299 outliers in mortgage out of 5000 rows leading to approximately 5.98% of outliers in overall data c. Mortgage is right skewed
#2
customers = df['ID'].unique().shape[0]
customers
print(f"*** How many customers have credit cards? ***")
print(f"a. All the {df.shape[0]} customers are unique")
customers_with_cc = df[df['CreditCard'] == 1].shape[0]
print(f"b. There are {customers_with_cc} customer with credit card")
*** How many customers have credit cards? *** a. All the 5000 customers are unique b. There are 1470 customer with credit card
#3
print(f"***What are the attributes that have a strong correlation with the target attribute (personal loan)?***")
plt.figure(figsize=(10,8))
sns.heatmap(data=df.corr(),cbar=False,fmt='.2f',cmap='Spectral',vmin=-1,vmax=1,annot=True)
print(f"a. Personal_Loan have strong positive co-relation with Income, CCAvg and CD_Account at {df['Personal_Loan'].corr(df['Income']):.2f},{df['Personal_Loan'].corr(df['CCAvg']):.2f} and {df['Personal_Loan'].corr(df['CD_Account']):.2f} respectively")
***What are the attributes that have a strong correlation with the target attribute (personal loan)?*** a. Personal_Loan have strong positive co-relation with Income, CCAvg and CD_Account at 0.50,0.37 and 0.32 respectively
#4
print(f"***How does a customer's interest in purchasing a loan vary with their age?***")
print(f"a. From the heatmap we could already say that age and personal loan doesn't exhibit a great corelation")
plt.figure(figsize=(10,8))
sns.scatterplot(x=df['Personal_Loan'],y=df['Age'])
print("b. Customers at all ages have interest in buying personal loans")
***How does a customer's interest in purchasing a loan vary with their age?*** a. From the heatmap we could already say that age and personal loan doesn't exhibit a great corelation b. Customers at all ages have interest in buying personal loans
#5
print(f"***How does a customer's interest in purchasing a loan vary with their education?***")
print(f"a. Education vs Personal loan")
for i in range(1,4):
print(f"{i}. Education with value {i}: {(df[(df['Education'] == i) & (df['Personal_Loan'] == 1)]).shape[0]} of {df[df['Education'] == i].shape[0]} have accepted personal loan")
print(f"b. Customers who have Advanced/Professional education are higher probable of accepting the personal loan")
***How does a customer's interest in purchasing a loan vary with their education?*** a. Education vs Personal loan 1. Education with value 1: 93 of 2096 have accepted personal loan 2. Education with value 2: 182 of 1403 have accepted personal loan 3. Education with value 3: 205 of 1501 have accepted personal loan b. Customers who have Advanced/Professional education are higher probable of accepting the personal loan
#6
print(f"***How does a customer's interest in purchasing a loan vary with their Income?***")
pl_vs_income = df[df['Personal_Loan'] == 1]
print(f"a. The lowest income where the customer accepted the loan is {pl_vs_income['Income'].min()}k which indicates that customers with income less than 60k have no interest in accepting personal loans")
print(f"b. Average income of the customer who accepted personal loan is {pl_vs_income['Income'].describe().T['mean']:.2f}k")
# Scatter Plot using Seaborn
plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x='Income', y='Personal_Loan', hue='Personal_Loan', palette={1: 'blue', 0: 'red'})
# Labels and Title
plt.xlabel('Income')
plt.ylabel('Loan Status')
plt.title('Income vs Loan Status')
plt.show()
***How does a customer's interest in purchasing a loan vary with their Income?*** a. The lowest income where the customer accepted the loan is 60k which indicates that customers with income less than 60k have no interest in accepting personal loans b. Average income of the customer who accepted personal loan is 144.75k
#7
print(f"***How does a customer's interest in purchasing a loan vary with their CCAvg?***")
pl_vs_ccavg = df[df['Personal_Loan'] == 1]
print(f"b. Customers who accepted personal loan have an average credit card amount of {pl_vs_ccavg['CCAvg'].describe().T['mean']:.2f}k")
# Scatter Plot using Seaborn
plt.figure(figsize=(10,8))
sns.scatterplot(data=df, x='CCAvg', y='Personal_Loan', hue='Personal_Loan', palette={1: 'blue', 0: 'red'})
# Labels and Title
plt.xlabel('CCAvg')
plt.ylabel('Loan Status')
plt.title('CCAvg vs Loan Status')
plt.show()
***How does a customer's interest in purchasing a loan vary with their CCAvg?*** b. Customers who accepted personal loan have an average credit card amount of 3.91k
#8 pairplot to identify relations better
plt.figure(figsize=(12,8))
sns.pairplot(df, hue='Personal_Loan',diag_kind='kde')
print(f"Income, Age, Mortatage and CCAvg clearly shows a strong corelation with personal loan acceptance")
Income, Age, Mortatage and CCAvg clearly shows a strong corelation with personal loan acceptance
<Figure size 1200x800 with 0 Axes>
Data Preprocessing¶
- Missing value treatment
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
#Missing value treatment
print(f"There are {df.isnull().sum().sum()} null values in the provided data")
print(f"{df.nunique()}")
#dropping ID column as there there are 5000 unique values with in 5000 rows
df = df.drop(columns=['ID'])
There are 0 null values in the provided data ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
#Feature engineering
#checking if zipcode or area customers live in affecting the acceptance of loan due to other parameters outside the provided data
print(f"Unique zipcode values: {df['ZIPCode'].nunique()}")
#corelation between zipcode and personal loan
print(f"correaltion between personal loan and ZIPCode{df['Personal_Loan'].corr(df['ZIPCode']):.2f}")
#Dropping the column zipcode as the corelation is too low for this feature to be considered
df = df.drop(columns=['ZIPCode'])
Unique zipcode values: 467 correaltion between personal loan and ZIPCode-0.00
#Column Expereince have high corelation with Age which directly related to the redundancy, dropping the column experience as well
df = df.drop(columns=['Experience'])
#Data prepeation for model
# Define features (X) and target (y)
X = df.drop(columns=['Personal_Loan']) # Drop the target column
y = df['Personal_Loan'] # Target column
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# Print the shapes of the resulting splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
print('Percentage of classes in train set')
print(100*y_train.value_counts(normalize=True), '\n')
print('Percentage of classes in test set')
print(100*y_test.value_counts(normalize=True), '\n')
print(f"Balance of classes are same in train and test sets")
X_train shape: (4000, 10) X_test shape: (1000, 10) y_train shape: (4000,) y_test shape: (1000,) Percentage of classes in train set Personal_Loan 0 90.4 1 9.6 Name: proportion, dtype: float64 Percentage of classes in test set Personal_Loan 0 90.4 1 9.6 Name: proportion, dtype: float64 Balance of classes are same in train and test sets
Model Evaluation Criterion¶
Model Building¶
Reusable functions
Functions for model performance, confusion matrix, plotting decision tree, plotting text tree
#function for model performance calculation
#input parameter:(model, X_train/X_test, y_prediction_train/y_prediction_test)
def model_performance_classification(model, predictors, target):
"""
Function to calculate all the model performance metrics
model: classifier
predictors: independent variables (X)
target: dependent variable (y)
"""
#prediction using predictors
pred = model.predict(predictors)
model_accuracy = accuracy_score(target, pred)
model_recall = recall_score(target, pred)
model_precision = precision_score(target,pred)
model_f1_score = f1_score(target,pred)
#creating dataframe of all the metrics
df_model_preformance = pd.DataFrame({
'Accuracy': model_accuracy,
'Recall': model_recall,
'Precision': model_precision,
'F1': model_f1_score
},
index=[0])
return df_model_preformance
#function for enabling confusion matrix
#input parameter:(model, X_train/X_test, y_prediction_train/y_prediction_test)
def plot_confusion_matrix(model, predictors, target):
"""
To calculate the confusion matrix with percentages
model: classifier
predictors: independent variables (X)
target: dependent variable (y)
"""
#predicting the target values
y_pred = model.predict(predictors)
#creating the confusion matrix
cm = confusion_matrix(target, y_pred)
#creating labels
labels = np.asarray([
["{0:0.0f}".format(item) + '\n{0:.2%}'.format(item/cm.flatten().sum())]
for item in cm.flatten()
]).reshape(2,2) #to matrix
#figure size for confusion matrix
plt.figure(figsize=(10,8))
sns.heatmap(cm,annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted lable')
#function to plot the decision tree for given model
#input parameter:(model)
from sklearn import tree
def plot_decision_tree(model):
feature_names = list(X_train.columns)
#setting figure size
plt.figure(figsize=(20,20))
#plotting the decision tree
out = tree.plot_tree(model, feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
print(f"Tree Depth: {model.get_depth()}")
print(f"Number of Leaves: {model.get_n_leaves()}")
plt.show()
# function for text report on the model
#input parameter:(model)
def plot_text_tree(model):
print(
tree.export_text(
model,
feature_names=list(X_train.columns),
show_weights=True
)
)
Decision Tree from sklearn model
dtree_model_one | decision tree plot | tree text plot | confusion matrix | performance matrix
#initializing the decision tree and fitting the training data
dtree_model_one = DecisionTreeClassifier(random_state=42)
dtree_model_one.fit(X_train, y_train)
DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
#Plotting the decision tree for model one
plot_decision_tree(dtree_model_one)
Tree Depth: 13 Number of Leaves: 66
#Plotting the decision tree text for model one
plot_text_tree(dtree_model_one)
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [2845.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 92.50 | | | | |--- Age <= 27.00 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Age > 27.00 | | | | | |--- CCAvg <= 3.65 | | | | | | |--- Mortgage <= 216.50 | | | | | | | |--- Income <= 82.50 | | | | | | | | |--- Age <= 36.50 | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | |--- Income <= 63.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- Income > 63.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- Age > 36.50 | | | | | | | | | |--- CCAvg <= 3.35 | | | | | | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 3.35 | | | | | | | | | | |--- Education <= 1.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- Education > 1.50 | | | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | |--- Income > 82.50 | | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.05 | | | | | | | | | |--- Education <= 2.50 | | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | | |--- Education > 2.50 | | | | | | | | | | |--- Mortgage <= 94.00 | | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | | |--- Mortgage > 94.00 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Mortgage > 216.50 | | | | | | | |--- Mortgage <= 249.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Mortgage > 249.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.65 | | | | | | |--- Mortgage <= 93.50 | | | | | | | |--- weights: [47.00, 0.00] class: 0 | | | | | | |--- Mortgage > 93.50 | | | | | | | |--- Mortgage <= 99.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Mortgage > 99.50 | | | | | | | | |--- weights: [19.00, 0.00] class: 0 | | | |--- Income > 92.50 | | | | |--- Education <= 1.50 | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | |--- Education > 1.50 | | | | | |--- Family <= 2.50 | | | | | | |--- Education <= 2.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Education > 2.50 | | | | | | | |--- Age <= 31.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 31.00 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | |--- Family > 2.50 | | | | | | |--- weights: [0.00, 4.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.25 | | | | |--- weights: [0.00, 9.00] class: 1 | | | |--- CCAvg > 4.25 | | | | |--- Mortgage <= 38.00 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Mortgage > 38.00 | | | | | |--- weights: [3.00, 0.00] class: 0 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 99.50 | | | | |--- Family <= 1.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Family > 1.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Income > 99.50 | | | | |--- Income <= 104.50 | | | | | |--- CCAvg <= 3.31 | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.31 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 33.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 33.00 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Income > 104.50 | | | | | |--- weights: [506.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- Online <= 0.50 | | | | | |--- CCAvg <= 1.45 | | | | | | |--- Age <= 40.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 40.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- CCAvg > 1.45 | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Online > 0.50 | | | | | |--- CCAvg <= 0.65 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- CCAvg > 0.65 | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | |--- Income > 113.50 | | | | |--- weights: [0.00, 54.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.95 | | | | |--- Income <= 106.50 | | | | | |--- weights: [36.00, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- Age <= 57.50 | | | | | | | |--- Age <= 33.50 | | | | | | | | |--- CCAvg <= 1.70 | | | | | | | | | |--- CCAvg <= 1.55 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 1.55 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- CCAvg > 1.70 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | |--- Age > 33.50 | | | | | | | | |--- Family <= 3.50 | | | | | | | | | |--- Age <= 36.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Age > 36.50 | | | | | | | | | | |--- Mortgage <= 231.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- Mortgage > 231.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Family > 3.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- Age > 57.50 | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- Family <= 1.50 | | | | | | | |--- Age <= 40.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Age > 40.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Family > 1.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- CCAvg > 2.95 | | | | |--- CCAvg <= 4.65 | | | | | |--- Mortgage <= 265.50 | | | | | | |--- Age <= 60.00 | | | | | | | |--- Mortgage <= 172.00 | | | | | | | | |--- CCAvg <= 3.70 | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | |--- CCAvg > 3.70 | | | | | | | | | |--- Family <= 2.50 | | | | | | | | | | |--- Education <= 2.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | | |--- Education > 2.50 | | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | | |--- Family > 2.50 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | |--- Mortgage > 172.00 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- Age > 60.00 | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | |--- Mortgage > 265.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- CCAvg > 4.65 | | | | | |--- weights: [0.00, 7.00] class: 1 | | |--- Income > 114.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- Family <= 1.50 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Family > 1.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- CCAvg > 1.10 | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 244.00] class: 1
# Confusion matrix calculation for model one training data
plot_confusion_matrix(dtree_model_one, X_train, y_train)
#Perfomance calculations for model one
dtree_model_one_train_perfromance = model_performance_classification(dtree_model_one, X_train, y_train)
dtree_model_one_train_perfromance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
#plotting confusion matrix for test data
plot_confusion_matrix(dtree_model_one, X_test, y_test)
#performance metrics for model one using test data
dtree_model_one_test_perfromance = model_performance_classification(dtree_model_one,X_test,y_test)
dtree_model_one_test_perfromance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.983 | 0.947917 | 0.883495 | 0.914573 |
Model Performance Improvement¶
Tree depth analysis for pre pruning
#Tree depth analysis
depths = range(1, 14) #As the tree depth is 13 from the above statement
train_scores = []
test_scores = []
for depth in depths:
tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
tree.fit(X_train, y_train)
train_scores.append(accuracy_score(y_train, tree.predict(X_train)))
test_scores.append(accuracy_score(y_test, tree.predict(X_test)))
# Plot train vs. test accuracy
plt.figure(figsize=(10, 6))
plt.plot(depths, train_scores, label='Train Accuracy', marker='o')
plt.plot(depths, test_scores, label='Test Accuracy', marker='o')
plt.xlabel('Tree Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Train vs Test Accuracy by Tree Depth')
plt.show()
#Analysis for max_depth, max_leaf_nodes and min_samples
max_depth_values = np.arange(2,13,2) #Excluding root node, and excluding final node
max_leaf_nodes_values = np.arange(10,51,10) #Generalized value
min_sample_split_values = np.arange(10,66,10) #
best_estimator = None
best_score_diff = float('inf')
for max_depth in max_depth_values:
for max_leaf in max_leaf_nodes_values:
for min_sample in min_sample_split_values:
estimator = DecisionTreeClassifier(
max_depth= max_depth,
max_leaf_nodes= max_leaf,
min_samples_split=min_sample,
random_state=42
)
estimator.fit(X_train,y_train)
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
train_f1_score = f1_score(y_train, y_train_pred)
test_f1_score = f1_score(y_test, y_test_pred)
score_diff = abs(train_f1_score - test_f1_score)
if score_diff < best_score_diff:
best_score_diff = score_diff
best_estimator = estimator
Decision Tree Pre prune
dtree_model_two | decision tree plot | tree text plot | confusion matrix | performance matrix
#Determning the best fit model by iterating max_depth, max)leaf_nodes and min_sample_split for training data set
dtree_model_two = best_estimator
dtree_model_two.fit(X_train,y_train)
DecisionTreeClassifier(max_depth=10, max_leaf_nodes=40, min_samples_split=20,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=10, max_leaf_nodes=40, min_samples_split=20,
random_state=42)#plotting tree with pre-pruned model
plot_decision_tree(dtree_model_two)
Tree Depth: 10 Number of Leaves: 32
#plotting tree with pre-pruned model
plot_text_tree(dtree_model_two)
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [2845.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 92.50 | | | | |--- Age <= 27.00 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Age > 27.00 | | | | | |--- CCAvg <= 3.65 | | | | | | |--- Mortgage <= 216.50 | | | | | | | |--- Income <= 82.50 | | | | | | | | |--- Age <= 36.50 | | | | | | | | | |--- weights: [7.00, 2.00] class: 0 | | | | | | | | |--- Age > 36.50 | | | | | | | | | |--- CCAvg <= 3.35 | | | | | | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 3.35 | | | | | | | | | | |--- weights: [9.00, 1.00] class: 0 | | | | | | | |--- Income > 82.50 | | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.05 | | | | | | | | | |--- weights: [9.00, 6.00] class: 0 | | | | | | |--- Mortgage > 216.50 | | | | | | | |--- weights: [1.00, 2.00] class: 1 | | | | | |--- CCAvg > 3.65 | | | | | | |--- Mortgage <= 93.50 | | | | | | | |--- weights: [47.00, 0.00] class: 0 | | | | | | |--- Mortgage > 93.50 | | | | | | | |--- Mortgage <= 99.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Mortgage > 99.50 | | | | | | | | |--- weights: [19.00, 0.00] class: 0 | | | |--- Income > 92.50 | | | | |--- weights: [9.00, 7.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [3.00, 10.00] class: 1 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 99.50 | | | | |--- weights: [2.00, 2.00] class: 0 | | | |--- Income > 99.50 | | | | |--- Income <= 104.50 | | | | | |--- CCAvg <= 3.31 | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.31 | | | | | | |--- weights: [3.00, 3.00] class: 0 | | | | |--- Income > 104.50 | | | | | |--- weights: [506.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- weights: [12.00, 6.00] class: 0 | | | |--- Income > 113.50 | | | | |--- weights: [0.00, 54.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.95 | | | | |--- Income <= 106.50 | | | | | |--- weights: [36.00, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- Age <= 57.50 | | | | | | | |--- Age <= 33.50 | | | | | | | | |--- weights: [14.00, 1.00] class: 0 | | | | | | | |--- Age > 33.50 | | | | | | | | |--- Family <= 3.50 | | | | | | | | | |--- weights: [13.00, 5.00] class: 0 | | | | | | | | |--- Family > 3.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- Age > 57.50 | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- weights: [2.00, 3.00] class: 1 | | | |--- CCAvg > 2.95 | | | | |--- CCAvg <= 4.65 | | | | | |--- CCAvg <= 3.35 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | |--- CCAvg > 3.35 | | | | | | |--- CCAvg <= 3.45 | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.45 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [8.00, 6.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | |--- CCAvg > 4.65 | | | | | |--- weights: [0.00, 7.00] class: 1 | | |--- Income > 114.50 | | | |--- Income <= 116.50 | | | | |--- weights: [2.00, 7.00] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 244.00] class: 1
#plotting confusion matrix for the newly derived model pre-prune for train data
plot_confusion_matrix(dtree_model_two, X_train, y_train)
#Performance calculation for the model with pre pruning technique using train data
dtree_model_two_train_performance = model_performance_classification(dtree_model_two, X_train, y_train)
dtree_model_two_train_performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98825 | 0.898438 | 0.977337 | 0.936228 |
#plotting confusion matrix for the newly derived model pre-prune for test data
plot_confusion_matrix(dtree_model_two, X_test, y_test)
#Performance calculation for second model with pre pruning technique using test data
dtree_model_two_test_performance = model_performance_classification(dtree_model_two, X_test, y_test)
dtree_model_two_test_performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.989 | 0.9375 | 0.947368 | 0.942408 |
Decision Tree Post prune
Deriving alphas | impurities | node counts | depth
#Applying post pruning technique to reduce the branches in the nodes
model_post_prune = DecisionTreeClassifier(random_state=42)
#cost complexity calculation
path = model_post_prune.cost_complexity_pruning_path(X_train, y_train)
#extracting the alphas
ccp_alphas = abs(path.ccp_alphas)
#extracting the impurities
impurities = path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000233 | 0.000467 |
| 2 | 0.000244 | 0.000954 |
| 3 | 0.000246 | 0.001446 |
| 4 | 0.000300 | 0.002046 |
| 5 | 0.000306 | 0.002965 |
| 6 | 0.000331 | 0.003958 |
| 7 | 0.000333 | 0.004291 |
| 8 | 0.000333 | 0.004624 |
| 9 | 0.000333 | 0.004958 |
| 10 | 0.000350 | 0.006008 |
| 11 | 0.000373 | 0.007499 |
| 12 | 0.000375 | 0.007874 |
| 13 | 0.000381 | 0.008255 |
| 14 | 0.000400 | 0.008655 |
| 15 | 0.000406 | 0.009874 |
| 16 | 0.000419 | 0.012390 |
| 17 | 0.000455 | 0.012845 |
| 18 | 0.000461 | 0.015148 |
| 19 | 0.000493 | 0.016133 |
| 20 | 0.000579 | 0.020187 |
| 21 | 0.000584 | 0.020771 |
| 22 | 0.000779 | 0.021550 |
| 23 | 0.000823 | 0.022373 |
| 24 | 0.000831 | 0.023204 |
| 25 | 0.000870 | 0.024945 |
| 26 | 0.002424 | 0.027369 |
| 27 | 0.002667 | 0.030036 |
| 28 | 0.003000 | 0.033036 |
| 29 | 0.003753 | 0.036789 |
| 30 | 0.020023 | 0.056812 |
| 31 | 0.021549 | 0.078361 |
| 32 | 0.047604 | 0.173568 |
#creating plot alphas
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1],impurities[:-1],marker='o',drawstyle='steps-post')
ax.set_xlabel('Effective alpha')
ax.set_ylabel('Total Impurities of leaves')
ax.set_title('Total Impurities vs Effective Alpha for train set')
Text(0.5, 1.0, 'Total Impurities vs Effective Alpha for train set')
#Creating the array of DecisionTreeClassifiers using each ccp_alpha value derived from post pruning method
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
clf.fit(X_train,y_train)
clfs.append(clf)
#Plotting Alpha vs Node count and Alpha vs Tree depth to identify the growth of tree
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2,1, figsize=(12,8))
ax[0].plot(ccp_alphas,node_counts,marker='o',drawstyle='steps-post')
ax[0].set_xlabel('Alphas')
ax[0].set_ylabel('Node count')
ax[0].set_title('Alpha vs Node Count')
ax[1].plot(ccp_alphas,depth,marker='o',drawstyle='steps-post')
ax[1].set_xlabel('Alphas')
ax[1].set_ylabel('Tree deptht')
ax[1].set_title('Alpha vs Tree depth')
fig.tight_layout()
#Creating the array for f1 scores for each alpha using the train data set
train_f1_scores = []
for clf in clfs:
pred_train = clf.predict(X_train)
f1_train = f1_score(y_train,pred_train)
train_f1_scores.append(f1_train)
#Creating the array for f1 scores for each alpha using the test data set
test_f1_scores = []
for clf in clfs:
pred_test = clf.predict(X_test)
f1_test = f1_score(y_test, pred_test)
test_f1_scores.append(f1_test)
#Plotting Alpha vs f1 scores for both train and test data
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel('Alpha')
ax.set_ylabel('F1 score')
ax.set_title('F1 score vs Alpha for training and test set')
ax.plot(ccp_alphas, train_f1_scores, marker='o', drawstyle='steps-post')
ax.plot(ccp_alphas, test_f1_scores, marker='o', drawstyle='steps-post')
ax.legend()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
<matplotlib.legend.Legend at 0x32deb01d0>
#Identifying the best post pruning model from the derived test scores
index_best_model = np.argmax(test_f1_scores)
dtree_model_three | decision tree plot | tree text plot | confusion matrix | performance matrix
#Assigning the best model identified to a model variable for further calculations
dtree_model_three = clfs[index_best_model]
dtree_model_three
DecisionTreeClassifier(ccp_alpha=0.0008702884311333967, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0008702884311333967, random_state=42)
#Plotting the decision tree for the new best model post pruned.
plot_decision_tree(dtree_model_three)
Tree Depth: 4 Number of Leaves: 9
#Plotting the decision tree for the new best model post pruned.
plot_text_tree(dtree_model_three)
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [2845.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- weights: [136.00, 21.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [3.00, 10.00] class: 1 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [531.00, 5.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- weights: [12.00, 6.00] class: 0 | | | |--- Income > 113.50 | | | | |--- weights: [0.00, 54.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.95 | | | | |--- weights: [75.00, 12.00] class: 0 | | | |--- CCAvg > 2.95 | | | | |--- weights: [12.00, 25.00] class: 1 | | |--- Income > 114.50 | | | |--- weights: [2.00, 251.00] class: 1
#Plotting the confusion matrix for the post pruned best model using training data set
plot_confusion_matrix(dtree_model_three, X_train, y_train)
#Plotting the confusion matrix for the post pruned best model using test data set
plot_confusion_matrix(dtree_model_three, X_test, y_test)
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using train data
dtree_model_three_train_perfromance = model_performance_classification(dtree_model_three, X_train,y_train)
print(dtree_model_three_train_perfromance)
Accuracy Recall Precision F1 0 0.98475 0.885417 0.952381 0.917679
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using test data
dtree_model_three_test_performance = model_performance_classification(dtree_model_three, X_test,y_test)
print(dtree_model_three_test_performance)
Accuracy Recall Precision F1 0 0.991 0.958333 0.948454 0.953368
Decision Tree Post pruning using pre prune model
Deriving alphas | impurities | node counts | depth
#post pruning the pre pruned tree to see for better fit of a model
path_pre_post = dtree_model_two.cost_complexity_pruning_path(X_train, y_train)
#extracting the alphas
ccp_alphas_pre_post = abs(path_pre_post.ccp_alphas)
#extracting the impurities
impurities_pre_post = path_pre_post.impurities
#creating a data frame to display ccp_alphas_pre_post and impurities associated with them in the nodes
pd.DataFrame(path_pre_post)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.015098 |
| 1 | 0.000037 | 0.015135 |
| 2 | 0.000141 | 0.015276 |
| 3 | 0.000214 | 0.015491 |
| 4 | 0.000246 | 0.015983 |
| 5 | 0.000343 | 0.016326 |
| 6 | 0.000364 | 0.016690 |
| 7 | 0.000371 | 0.017432 |
| 8 | 0.000372 | 0.018175 |
| 9 | 0.000429 | 0.019891 |
| 10 | 0.000485 | 0.020376 |
| 11 | 0.000584 | 0.020960 |
| 12 | 0.000585 | 0.023300 |
| 13 | 0.000822 | 0.024945 |
| 14 | 0.002424 | 0.027369 |
| 15 | 0.002667 | 0.030036 |
| 16 | 0.003000 | 0.033036 |
| 17 | 0.003753 | 0.036789 |
| 18 | 0.020023 | 0.056812 |
| 19 | 0.021549 | 0.078361 |
| 20 | 0.047604 | 0.173568 |
#creating plot alphas
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas_pre_post[:-1],impurities_pre_post[:-1],marker='o',drawstyle='steps-post')
ax.set_xlabel('Effective alpha')
ax.set_ylabel('Total Impurities of leaves')
ax.set_title('Total Impurities vs Effective Alpha for train set')
Text(0.5, 1.0, 'Total Impurities vs Effective Alpha for train set')
#Creating an array with alphas that were derived from post pruning using the pre-pruned model
clfs_pre_post = []
for ccp_alpha in ccp_alphas_pre_post:
clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
clf.fit(X_train,y_train)
clfs_pre_post.append(clf)
#Plotting Alpha vs Node count and Alpha vs Tree depth to identify the growth of tree
clfs_pre_post = clfs_pre_post[:-1]
ccp_alphas_pre_post = ccp_alphas_pre_post[:-1]
node_counts_pre_post = [clf_pre_post.tree_.node_count for clf_pre_post in clfs_pre_post]
depth_pre_post = [clf_pre_post.tree_.max_depth for clf_pre_post in clfs_pre_post]
fig, ax = plt.subplots(2,1, figsize=(12,8))
ax[0].plot(ccp_alphas_pre_post,node_counts_pre_post,marker='o',drawstyle='steps-post')
ax[0].set_xlabel('Alphas')
ax[0].set_ylabel('Node count')
ax[0].set_title('Alpha vs Node Count')
ax[1].plot(ccp_alphas_pre_post,depth_pre_post,marker='o',drawstyle='steps-post')
ax[1].set_xlabel('Alphas')
ax[1].set_ylabel('Tree deptht')
ax[1].set_title('Alpha vs Tree depth')
fig.tight_layout()
#Creating the array for f1 scores for each alpha using the train data set
train_f1_scores_pre_post = []
for clf in clfs_pre_post:
pred_train = clf.predict(X_train)
f1_train = f1_score(y_train,pred_train)
train_f1_scores_pre_post.append(f1_train)
#Creating the array for f1 scores for each alpha using the test data set
test_f1_scores_pre_post = []
for clf in clfs_pre_post:
pred_test = clf.predict(X_test)
f1_test = f1_score(y_test, pred_test)
test_f1_scores_pre_post.append(f1_test)
#Plotting Alpha vs f1 scores for both train and test data
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel('Alpha')
ax.set_ylabel('F1 score')
ax.set_title('F1 score vs Alpha for training and test set')
ax.plot(ccp_alphas_pre_post, train_f1_scores_pre_post, marker='o', drawstyle='steps-post')
ax.plot(ccp_alphas_pre_post, test_f1_scores_pre_post, marker='o', drawstyle='steps-post')
ax.legend()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
<matplotlib.legend.Legend at 0x354db9d60>
#Identifying the best post pruning model from the derived test scores
index_best_model_pre_post = np.argmax(test_f1_scores_pre_post)
dtree_model_four | decision tree plot | tree text plot | confusion matrix | performance matrix
#Assigning the best model identified to a model variable for further calculations
dtree_model_four = clfs[index_best_model_pre_post]
dtree_model_four
DecisionTreeClassifier(ccp_alpha=0.0003999999999999999, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0003999999999999999, random_state=42)
#Plotting the decision tree for the new best model post pruned.
plot_decision_tree(dtree_model_four)
Tree Depth: 13 Number of Leaves: 39
#Plotting the decision tree for the new best model post pruned.
plot_text_tree(dtree_model_four)
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [2845.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 92.50 | | | | |--- Age <= 27.00 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Age > 27.00 | | | | | |--- CCAvg <= 3.65 | | | | | | |--- Mortgage <= 216.50 | | | | | | | |--- Income <= 82.50 | | | | | | | | |--- weights: [45.00, 3.00] class: 0 | | | | | | | |--- Income > 82.50 | | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.05 | | | | | | | | | |--- Education <= 2.50 | | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | | |--- weights: [1.00, 2.00] class: 1 | | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | | |--- Education > 2.50 | | | | | | | | | | |--- weights: [1.00, 4.00] class: 1 | | | | | | |--- Mortgage > 216.50 | | | | | | | |--- weights: [1.00, 2.00] class: 1 | | | | | |--- CCAvg > 3.65 | | | | | | |--- weights: [66.00, 1.00] class: 0 | | | |--- Income > 92.50 | | | | |--- Education <= 1.50 | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | |--- Education > 1.50 | | | | | |--- weights: [3.00, 7.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.25 | | | | |--- weights: [0.00, 9.00] class: 1 | | | |--- CCAvg > 4.25 | | | | |--- weights: [3.00, 1.00] class: 0 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 99.50 | | | | |--- Family <= 1.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Family > 1.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Income > 99.50 | | | | |--- weights: [529.00, 3.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- Online <= 0.50 | | | | | |--- weights: [2.00, 5.00] class: 1 | | | | |--- Online > 0.50 | | | | | |--- CCAvg <= 0.65 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- CCAvg > 0.65 | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | |--- Income > 113.50 | | | | |--- weights: [0.00, 54.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.95 | | | | |--- Income <= 106.50 | | | | | |--- weights: [36.00, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- CCAvg <= 2.45 | | | | | | |--- Age <= 57.50 | | | | | | | |--- Age <= 33.50 | | | | | | | | |--- weights: [14.00, 1.00] class: 0 | | | | | | | |--- Age > 33.50 | | | | | | | | |--- Family <= 3.50 | | | | | | | | | |--- Age <= 36.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- Age > 36.50 | | | | | | | | | | |--- Mortgage <= 231.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- Mortgage > 231.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Family > 3.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- Age > 57.50 | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.45 | | | | | | |--- weights: [2.00, 3.00] class: 1 | | | |--- CCAvg > 2.95 | | | | |--- CCAvg <= 4.65 | | | | | |--- Mortgage <= 265.50 | | | | | | |--- Age <= 60.00 | | | | | | | |--- Mortgage <= 172.00 | | | | | | | | |--- CCAvg <= 3.70 | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | |--- CCAvg > 3.70 | | | | | | | | | |--- Family <= 2.50 | | | | | | | | | | |--- Education <= 2.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | | |--- Education > 2.50 | | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | | |--- Family > 2.50 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | |--- Mortgage > 172.00 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- Age > 60.00 | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | |--- Mortgage > 265.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- CCAvg > 4.65 | | | | | |--- weights: [0.00, 7.00] class: 1 | | |--- Income > 114.50 | | | |--- weights: [2.00, 251.00] class: 1
#Plotting the confusion matrix for the post pruned best model using training data set
plot_confusion_matrix(dtree_model_four, X_train, y_train)
#Plotting the confusion matrix for the post pruned best model using test data set
plot_confusion_matrix(dtree_model_four, X_test, y_test)
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using train data
dtree_model_four_train_perfromance = model_performance_classification(dtree_model_four, X_train,y_train)
print(dtree_model_four_train_perfromance)
Accuracy Recall Precision F1 0 0.99475 0.976562 0.968992 0.972763
#Calculating Accuracy, Recall, Precision and F1 scores for post prune model using test data
dtree_model_four_test_performance = model_performance_classification(dtree_model_four, X_test,y_test)
print(dtree_model_four_test_performance)
Accuracy Recall Precision F1 0 0.982 0.947917 0.875 0.91
Model Performance Comparison and Final Model Selection¶
#Comparing the model performances for train data which were derived earlier
models_train_comparison_df = pd.concat(
[
dtree_model_one_train_perfromance.T,
dtree_model_two_train_performance.T,
dtree_model_three_train_perfromance.T,
dtree_model_four_train_perfromance.T
],
axis=1
)
models_train_comparison_df.columns = [
"Normal Decision Tree",
"Prepruned Decision Tree",
"Post pruned Decision Tree",
"Pre Post pruned Decision Tree"
]
models_train_comparison_df
| Normal Decision Tree | Prepruned Decision Tree | Post pruned Decision Tree | Pre Post pruned Decision Tree | |
|---|---|---|---|---|
| Accuracy | 1.0 | 0.988250 | 0.984750 | 0.994750 |
| Recall | 1.0 | 0.898438 | 0.885417 | 0.976562 |
| Precision | 1.0 | 0.977337 | 0.952381 | 0.968992 |
| F1 | 1.0 | 0.936228 | 0.917679 | 0.972763 |
#Comparing the model performances for test data which were derived earlier
models_test_comparison_df = pd.concat(
[
dtree_model_one_test_perfromance.T,
dtree_model_two_test_performance.T,
dtree_model_three_test_performance.T,
dtree_model_four_test_performance.T
],
axis=1
)
models_test_comparison_df.columns = [
"Normal Decision Tree",
"Prepruned Decision Tree",
"Post pruned Decision Tree",
"Pre Post pruned Decision Tree"
]
models_test_comparison_df
| Normal Decision Tree | Prepruned Decision Tree | Post pruned Decision Tree | Pre Post pruned Decision Tree | |
|---|---|---|---|---|
| Accuracy | 0.983000 | 0.989000 | 0.991000 | 0.982000 |
| Recall | 0.947917 | 0.937500 | 0.958333 | 0.947917 |
| Precision | 0.883495 | 0.947368 | 0.948454 | 0.875000 |
| F1 | 0.914573 | 0.942408 | 0.953368 | 0.910000 |
scores_diff = models_test_comparison_df - models_train_comparison_df
scores_diff
| Normal Decision Tree | Prepruned Decision Tree | Post pruned Decision Tree | Pre Post pruned Decision Tree | |
|---|---|---|---|---|
| Accuracy | -0.017000 | 0.000750 | 0.006250 | -0.012750 |
| Recall | -0.052083 | 0.039062 | 0.072917 | -0.028646 |
| Precision | -0.116505 | -0.029969 | -0.003927 | -0.093992 |
| F1 | -0.085427 | 0.006180 | 0.035689 | -0.062763 |
From the score difference above we can clearly conclude that post pruned decision tree have the best scores in both accuracy and F1 compared to other models
Actionable Insights and Business Recommendations¶
Technical conclusions from model
- Customers with an income greater than 98.5 (thousand dollars) are more likely to accept personal loans.
- Customers with higher credit card spending averages (CCAvg > 2.95) are more likely to take personal loans.
- Customers with a CD account (CD_Account > 0.5) and high credit card spending (CCAvg > 2.95) are more likely to accept loans (10 out of 13 instances in the corresponding node are positive).
- Customers with a family size greater than 2.5 who have moderate income (98.5 < Income <= 113.5) are less likely to accept loans (12 out of 18 instances are in the negative class). However, customers with a family size greater than 2.5 and higher income (Income > 113.5) are significantly more likely to accept loans (54 out of 54 instances are positive).
- Customers with advanced education (Education > 1.5) are more likely to take personal loans compared to those with lower education levels.
- Customers with a combination of high income, high credit card spending, and CD accounts are the strongest candidates for accepting loans.
- Customers with low income (Income <= 98.5) and low credit card spending (CCAvg <= 2.95) are highly unlikely to accept personal loans.
- What recommedations would you suggest to the bank?
- Target high-income customers in the bank’s marketing campaigns for personal loans. • Prioritize customers with incomes exceeding 114.5, as they are predominantly in the positive class of accepting personal loan.
- Market personal loans specifically to CD account holders with high credit card spending. • Bundle personal loan offers with CD promotions or discounts to encourage loan acceptance.
- For families larger than 2.5, prioritize targeting those with higher income brackets. • Target customers with graduate or advanced degrees with marketing messages highlighting financial literacy, investment opportunities, and tailored loan products. Consider advertising for professional development or higher eduction.